Spoken Russian in the Russian National Corpus (RNC)
نویسنده
چکیده
The RNC now it is a 120 million-word collection of Russian text, thus, it is the most representative and authoritative corpus of the Russian language. It is available in the Internet at www.ruscorpora.ru. The RNC contains texts of all genres and types, which covers Russian from 19 up to 21 centuries. The practice of national corpora constructing has revealed that it’s indispensable to include in the RNC the sub-corpora of spoken language. Therefore, the constructors of the RNC have an intention to include in it about 10 million words of Spoken
منابع مشابه
Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus
In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the tra...
متن کاملIdentification of context markers for Russian nouns
The research project presented in this paper aims at identification of context markers for Russian nouns and their use in construction identification. The body of contexts has been extracted from the Russian National Corpus (RNC). The context processing procedure takes into account the lexical and semantic information represented in the corpus annotation. Merged meaning of words are taken into ...
متن کاملMultimodal Russian Corpus (MURCO): First Steps
The paper introduces the Multimodal Russian Corpus (MURCO), which has been created in the framework of the Russian National Corpus (RNC). The MURCO provides the users with the great amount of phonetic, orthoepic, intonational information related to Russian. Moreover, the deeply annotated part of the MURCO contains the data concerning Russian gesticulation, speech act system, types of vocal gest...
متن کاملTexts in, meaning out: neural language models in semantic similarity task for Russian
Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from 2nd...
متن کاملDisambiguation of Taxonomy Markers in Context: Russian Nouns
The paper presents experimental results on WSD, with focus on disambiguation of Russian nouns that refer to tangible objects and abstract notions. The body of contexts has been extracted from the Russian National Corpus (RNC). The tool used in our experiments is aimed at statistical processing and classification of noun contexts. The WSD procedure takes into account taxonomy markers of word mea...
متن کامل